!pip install plotly==5.6.0
Requirement already satisfied: plotly==5.6.0 in c:\users\caevi\anaconda3\lib\site-packages (5.6.0) Requirement already satisfied: six in c:\users\caevi\anaconda3\lib\site-packages (from plotly==5.6.0) (1.16.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\caevi\anaconda3\lib\site-packages (from plotly==5.6.0) (8.0.1)
# import basic libraries
import pandas as pd
from random import random
import numpy as np
from scipy.ndimage.filters import gaussian_filter1d
from datetime import datetime
# data visualization libraries
import matplotlib.pyplot as plt # matplotlib for basic plotting
import seaborn as sns # seaborn for simplified/prettified plots
sns.set_theme() # apply the default theme
sns.set(rc = {'figure.figsize':(15,8)}) # apply larger size
import plotly.express as px # plotly for interactive visualizations
# read in the data
df = pd.read_csv('../data/wiid.csv')
df.head()
| id | country | c3 | c2 | year | gini_reported | q1 | q2 | q3 | q4 | ... | median_usd | gdp_ppp_pc_usd2011 | population | revision | quality | quality_score | source | source_detailed | source_comments | survey | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Afghanistan | AFG | AF | 2008 | 29.00 | 9.00 | 13.00 | 17.00 | 22.00 | ... | NaN | 1298.0 | 27294031.0 | New 2013 | High | 12 | National statistical authority | European Commission and the Government of Afgh... | National Risk and Vulnerability Assessment | NaN |
| 1 | 2 | Albania | ALB | AL | 1996 | 27.01 | 9.15 | 13.70 | 17.73 | 23.29 | ... | 1982.0 | 4812.0 | 3092228.0 | New 2018 | Average | 13 | World Bank | World Bank 2018 | PovcalNet | NaN |
| 2 | 3 | Albania | ALB | AL | 2002 | 31.74 | 8.35 | 12.58 | 16.49 | 22.21 | ... | 1902.0 | 6316.0 | 3119029.0 | New 2018 | Average | 13 | World Bank | World Bank 2018 | PovcalNet | NaN |
| 3 | 4 | Albania | ALB | AL | 2005 | 30.60 | 8.40 | 12.90 | 17.03 | 22.50 | ... | 2217.0 | 7563.0 | 3079179.0 | New 2018 | Average | 13 | World Bank | World Bank 2018 | PovcalNet | NaN |
| 4 | 5 | Albania | ALB | AL | 2008 | 29.98 | 8.87 | 13.07 | 16.83 | 22.23 | ... | 2385.0 | 9018.0 | 2991651.0 | New 2018 | Average | 13 | World Bank | World Bank 2018 | PovcalNet | NaN |
5 rows × 55 columns
This is a dataset with economic statistics on most countries, compiled from multiple data sources - including the World Bank, regional databases like SEDLAC (Socio-Economic Database for Latin America & the Caribbean), UNECLAC, and EDLAC, academic papers, and more. It contains about 10,000 rows of non-null data for each column, and includes data about consumption, income, earnings, inequality (Gini coefficient), and more.
To start, we will plot the data on a world map to visualize the differences in population and GDP per capita in different parts of the world.
geo_df = df.copy()
geo_df = geo_df[geo_df['population'].notna()] # drop NA values
geo_df = geo_df.sort_values('year', ascending=False)
fig = px.scatter_geo(geo_df, locations="country", locationmode="country names",
color="gdp_ppp_pc_usd2011",
hover_name="country", size="population",
animation_frame="year",
projection="natural earth",
title="Animated Geovisualization of GDP Per Capita and Population Over Time")
fig.show()
The animated visualization above is useful for understanding the differences in GDP per capita across the world. The brighter colors show that regions like the US and Canada, Europe, Japan, Korea, and Australia have significantly higher GDPs per capita than most of the rest of the world. However, the size of the bubbles shows that regions like China, India, and Indonesia comprise a far greater proportion of global population. Finally, the time slider at the bottom of the graph can be used to animate the data over time. This also allows us to see changes in the availability of data over time (e.g. GDP per capita data is not available until the 1990s, and Africa is very underrepresented in this dataset), revealing how this dataset is biased or flawed. The animation can also be used to visualize trends and changes in GDP and population over time.
Economic conditions change dramatically depending on the part of the world a country. It is common knowledge that consumers in regions like Americas and Europe spend more than consumers in regions like Sub-Saharan Africa. In this part, I visualize the data to test this assumption and determine which regions make up a larger share of global household consumption.
# determine the quality of data in this dataset
df.quality.value_counts()
High 5997 Average 3086 Low 1812 Not known 206 Name: quality, dtype: int64
The data includes a column with labels on the estimaed quality of data. I thought it may be better to filter out low-quality or unknown-quality observations, but this would filter out about 34% of the observations. It would also bias the resulting data toward higher-income countries where better data is available. Thus, I did not filter the data by quality.
# subset the data to only rows with consumption per capita
df_cons = df[(((df['resource'] == 'Consumption') &
(df['scale'] == 'Per capita')) &
(df['sharing_unit'] == 'Household'))]
df_cons.shape # reduces df size from 11101 to 1239 rows
# also subset to specific year - 2010 - to avoid duplicate values from years
df_cons = df[df['year'] == 2010]
df_cons[['mean']].isna().sum() # find number of NA values in mean
df_cons = df_cons[df_cons['mean'].notna()] # drop NA values
# groupby region and sum to get the total average household consumption of the region
avg_cons = df_cons.groupby('region_un').agg({'mean':'sum'}).reset_index()
fig = px.pie(avg_cons, values='mean', names='region_un', title='Total Average Household Consumption per Capita by Region')
fig.show()
This chart was created by taking all the mean househould consumption per capita values, and then adding them together to get a total average consumption value for the entire region. This shows that Asia dominates in global consumer consumption, making up a huge percentage of the global total along with the Americas and Europe. However, this should be taken with a grain of salt, as Asia also has most of the world's population. See the graph below.
# plot global population using plotly
avg_pop = df_cons.groupby('region_un').agg({'population':'sum'}).reset_index()
fig = px.pie(avg_pop, values='population', names='region_un', title='Global Population Composition (in this Dataset)')
fig.show()
# plt.pie(avg_pop['population'], labels = avg_pop['region_un'], autopct='%.0f%%', rotatelabels=True)
# plt.show()
The visualizations above support the conclusion that the Asia makes up the vast majority of global consumption, while the Americas and Europe make up most of the remainder. However, when accounting for population, the proportion of consumption in Asia no longer seems so dramatic. Also, this dataset is likely somewhat biased (it includes less data from Africa and Oceania, so these regions have lower populations than their actual values).
The Gini coefficient is a number meant to represent inequality within a group (like a nation). Here, I try to determine how the Gini coefficient varies over time, both globally and for specific groups of countries.
sns.set(rc = {'figure.figsize':(15,8)})
(sns.lineplot(data=df, x="year", y="gini_reported")
.set(title='Global Gini index over time'))
[Text(0.5, 1.0, 'Global Gini index over time')]
This graph shows that collecting data on Gini coefficients began in earnest in around 1940. From there, global inequality as measured by Gini coefficients increased, peaking in around 1960. Inequality declined from the 1970s to late 1980s. Finally, global inequality stagnated around a Gini coefficient of around 40 from 2000 to 2020. This data is based on a global Gini coefficients, however, and inequality in some regions has likely increased (e.g. in the US). See below.
# subset the data to countries of interest
countries_of_interest = ['United States', 'Germany', 'China', 'Russia', 'India', 'Brazil', 'South Africa', 'Indonesia']
df_interest = df[df['country'].isin(countries_of_interest)]
df_interest = df_interest[df_interest['year'] > 1950] # use only data from after 1950
# plot this data
(sns.lineplot(data=df_interest, x="year", y="gini_reported", hue="country")
.set(title='Gini index in Countries of Interest Over Time'))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
This chart is a modification of the above, but for a specific group of countries of interest. Gini-inequality is especially high in Brazil, but has been declining in Brazil since 2000. In contrast, Gini-inequality is extremely high and has no sign of declining in South Africa. The United States has hovered around a Gini coefficient of 40 since the 1950s, but in the last 10 years inequality has increased dramatically. China has experienced a significant swing in inequality, with its Gini coefficient decreasing to the extremely low value of 20 around the 1980s, but then increasing again to be on par with the United States by the 2020s. This may reflect privatization and the rise of state capitalism in China. Another notable factor in this chart is the skyrocketing inequality in Russia after the collapse of the Soviet Union in the 1990s.
# plot a line graph of Gini coefficients over time by income group
gini_df = df.copy()
gini_df = gini_df[gini_df['year'] > 1950]
# construct a rolling average of every 3 years (for data smoothing)
gini_df['gini_avg'] = gini_df['gini_reported'].rolling(3).mean()
# plot the graph by income group
(sns.lineplot(data=gini_df, x="year", y="gini_avg", hue="incomegroup")
.set(title='Gini index by Income Group Over Time'))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
Finally, the above shows how the Gini coefficients of countries in different income groups have changed over time. Somewhat surprisingly, high-income countries show the lowest average Gini coefficient, using a 3-year moving average. Upper middle income countries show the highest inequality, while low-income countries show extreme variability until the 2010s, where they reach a more stable Gini coefficient, averaging between 40 and 50.
Here, we attempt to determine how inequality as measured by the Gini coefficient is related GDP and population.
# create plotly graph of GDP over time
px_df = df.copy()
px_df = px_df[px_df['year'] > 1990] # filter to after 1990, when GDP data started
px_df = px_df[px_df['population'] > 50000000] # filter to countries w/ population over 50 million
fig = px.line(px_df, x="year", y='gdp_ppp_pc_usd2011', color='country', title='National Purchasing Power Parity Per Capita GDP in 2011 Dollars over Time')
fig.show()
Most countries show increasing GDP per capita since the 1990s. The US has an especially high GDP per capita. A slight decrease in GDP per capita is also visible during the 2008-2009 global financial crisis and recession. The most significant and consistent incrase in GDP per capita is in China, which has increased its GDP per capita in 2011 dollars from 3,922 in 2005 to 14,146 in 2016 - a nearly 5x increase.
fig = px.scatter_3d(px_df, x='year', y='gdp_ppp_pc_usd2011', z='gini_reported',
color='country', title='3D Scatterplot of Gini Index and GDP Over Time')
fig.show()
The scatterplot is hard to interpret, but does allow finding some interesting correlations. South Africa is a notable cluster, which has a much higher Gini coefficient than expected, higher than other countries for the same time period and comparable GDP per capita. Shifting the graph so that GDP per capita is in focus, with Gini index in the background, reveals that there is no overwhelmingly clear correlation between GDP and inequality. However, it does seem that there is a rough correlation, where Gini index declines with increasing GDP per capita. Countries like Brazil, Mexico, and South Africa have very high Gini coefficients for their GDP, and all have relatively low GDPs per capita. This data also allows one to explore in a more open-ended way, to identify specific years, countries, and Gini coefficients.
avg_gini = (px_df.groupby(['country', 'year'])
.agg({'gini_reported':'mean', 'gdp_ppp_pc_usd2011':'mean', 'population':'sum'}).reset_index())
fig = px.scatter(avg_gini, x="year", y="gini_reported", size="population", color="gdp_ppp_pc_usd2011",
hover_name="country", size_max=60, title="Gini coefficients over time, with population and GDP per capita")
fig.show()
This chart allows us to see some trends more clearly. The line of the US is much brighter and more visible because of its higher GDP per capita, and we can see that US inequality has increased steadily, with 2017 as an outlier high-inequality year. All countries seem to be converging somewhat, toward Gini coefficients in the 40-50 range. The world's overall GDP per capita is visible here in brighter colors in later years, with China's increasing GDP per capita especially visible. Brazil's decreasing Gini coefficient and China's increasing inequality are also evident.